Import Libraries¶

In [1]:
import pandas as pd
import plotly.express as px

Explore the Dataset¶

In [2]:
# Load the dataset
df = pd.read_csv("US_Accidents_March23.csv")
In [3]:
# Display the first few rows of the dataset
df.head()
Out[3]:
ID Source Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) ... Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 A-1 Source2 3 2016-02-08 05:46:00 2016-02-08 11:00:00 39.865147 -84.058723 NaN NaN 0.01 ... False False False False False False Night Night Night Night
1 A-2 Source2 2 2016-02-08 06:07:59 2016-02-08 06:37:59 39.928059 -82.831184 NaN NaN 0.01 ... False False False False False False Night Night Night Day
2 A-3 Source2 2 2016-02-08 06:49:27 2016-02-08 07:19:27 39.063148 -84.032608 NaN NaN 0.01 ... False False False False True False Night Night Day Day
3 A-4 Source2 3 2016-02-08 07:23:34 2016-02-08 07:53:34 39.747753 -84.205582 NaN NaN 0.01 ... False False False False False False Night Day Day Day
4 A-5 Source2 2 2016-02-08 07:39:07 2016-02-08 08:09:07 39.627781 -84.188354 NaN NaN 0.01 ... False False False False True False Day Day Day Day

5 rows × 46 columns

In [4]:
# Data shape
df.shape
Out[4]:
(7728394, 46)
In [5]:
# Data Types and Missing Values 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7728394 entries, 0 to 7728393
Data columns (total 46 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   ID                     object 
 1   Source                 object 
 2   Severity               int64  
 3   Start_Time             object 
 4   End_Time               object 
 5   Start_Lat              float64
 6   Start_Lng              float64
 7   End_Lat                float64
 8   End_Lng                float64
 9   Distance(mi)           float64
 10  Description            object 
 11  Street                 object 
 12  City                   object 
 13  County                 object 
 14  State                  object 
 15  Zipcode                object 
 16  Country                object 
 17  Timezone               object 
 18  Airport_Code           object 
 19  Weather_Timestamp      object 
 20  Temperature(F)         float64
 21  Wind_Chill(F)          float64
 22  Humidity(%)            float64
 23  Pressure(in)           float64
 24  Visibility(mi)         float64
 25  Wind_Direction         object 
 26  Wind_Speed(mph)        float64
 27  Precipitation(in)      float64
 28  Weather_Condition      object 
 29  Amenity                bool   
 30  Bump                   bool   
 31  Crossing               bool   
 32  Give_Way               bool   
 33  Junction               bool   
 34  No_Exit                bool   
 35  Railway                bool   
 36  Roundabout             bool   
 37  Station                bool   
 38  Stop                   bool   
 39  Traffic_Calming        bool   
 40  Traffic_Signal         bool   
 41  Turning_Loop           bool   
 42  Sunrise_Sunset         object 
 43  Civil_Twilight         object 
 44  Nautical_Twilight      object 
 45  Astronomical_Twilight  object 
dtypes: bool(13), float64(12), int64(1), object(20)
memory usage: 2.0+ GB
In [6]:
# Summary Statistics
df.describe()
Out[6]:
Severity Start_Lat Start_Lng End_Lat End_Lng Distance(mi) Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Speed(mph) Precipitation(in)
count 7.728394e+06 7.728394e+06 7.728394e+06 4.325632e+06 4.325632e+06 7.728394e+06 7.564541e+06 5.729375e+06 7.554250e+06 7.587715e+06 7.551296e+06 7.157161e+06 5.524808e+06
mean 2.212384e+00 3.620119e+01 -9.470255e+01 3.626183e+01 -9.572557e+01 5.618423e-01 6.166329e+01 5.825105e+01 6.483104e+01 2.953899e+01 9.090376e+00 7.685490e+00 8.407210e-03
std 4.875313e-01 5.076079e+00 1.739176e+01 5.272905e+00 1.810793e+01 1.776811e+00 1.901365e+01 2.238983e+01 2.282097e+01 1.006190e+00 2.688316e+00 5.424983e+00 1.102246e-01
min 1.000000e+00 2.455480e+01 -1.246238e+02 2.456601e+01 -1.245457e+02 0.000000e+00 -8.900000e+01 -8.900000e+01 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 2.000000e+00 3.339963e+01 -1.172194e+02 3.346207e+01 -1.177543e+02 0.000000e+00 4.900000e+01 4.300000e+01 4.800000e+01 2.937000e+01 1.000000e+01 4.600000e+00 0.000000e+00
50% 2.000000e+00 3.582397e+01 -8.776662e+01 3.618349e+01 -8.802789e+01 3.000000e-02 6.400000e+01 6.200000e+01 6.700000e+01 2.986000e+01 1.000000e+01 7.000000e+00 0.000000e+00
75% 2.000000e+00 4.008496e+01 -8.035368e+01 4.017892e+01 -8.024709e+01 4.640000e-01 7.600000e+01 7.500000e+01 8.400000e+01 3.003000e+01 1.000000e+01 1.040000e+01 0.000000e+00
max 4.000000e+00 4.900220e+01 -6.711317e+01 4.907500e+01 -6.710924e+01 4.417500e+02 2.070000e+02 2.070000e+02 1.000000e+02 5.863000e+01 1.400000e+02 1.087000e+03 3.647000e+01

Data Cleaning and EDA¶

In [7]:
# Before cleaning the data check null count
df.isnull().sum()
Out[7]:
ID                             0
Source                         0
Severity                       0
Start_Time                     0
End_Time                       0
Start_Lat                      0
Start_Lng                      0
End_Lat                  3402762
End_Lng                  3402762
Distance(mi)                   0
Description                    5
Street                     10869
City                         253
County                         0
State                          0
Zipcode                     1915
Country                        0
Timezone                    7808
Airport_Code               22635
Weather_Timestamp         120228
Temperature(F)            163853
Wind_Chill(F)            1999019
Humidity(%)               174144
Pressure(in)              140679
Visibility(mi)            177098
Wind_Direction            175206
Wind_Speed(mph)           571233
Precipitation(in)        2203586
Weather_Condition         173459
Amenity                        0
Bump                           0
Crossing                       0
Give_Way                       0
Junction                       0
No_Exit                        0
Railway                        0
Roundabout                     0
Station                        0
Stop                           0
Traffic_Calming                0
Traffic_Signal                 0
Turning_Loop                   0
Sunrise_Sunset             23246
Civil_Twilight             23246
Nautical_Twilight          23246
Astronomical_Twilight      23246
dtype: int64
In [8]:
cat = ['Description','Street','City','Zipcode','Timezone','Airport_Code','Weather_Timestamp',
       'Wind_Direction','Weather_Condition','Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight']
In [9]:
# Replacing null values with mode for categorical variables

for i in cat:
    mean = df[i].mode()[0]
    df[i].fillna(mean,inplace = True)
In [10]:
continues = ['End_Lat','End_Lng','Temperature(F)','Wind_Chill(F)','Humidity(%)','Pressure(in)','Visibility(mi)',
             'Wind_Speed(mph)','Precipitation(in)']
In [11]:
# Replacing null values with mode for Numerical variables

for i in continues:
    mean1= df[i].mean()
    df[i].fillna(mean1,inplace = True)
In [12]:
# After cleaning the data
df.isnull().sum()
Out[12]:
ID                       0
Source                   0
Severity                 0
Start_Time               0
End_Time                 0
Start_Lat                0
Start_Lng                0
End_Lat                  0
End_Lng                  0
Distance(mi)             0
Description              0
Street                   0
City                     0
County                   0
State                    0
Zipcode                  0
Country                  0
Timezone                 0
Airport_Code             0
Weather_Timestamp        0
Temperature(F)           0
Wind_Chill(F)            0
Humidity(%)              0
Pressure(in)             0
Visibility(mi)           0
Wind_Direction           0
Wind_Speed(mph)          0
Precipitation(in)        0
Weather_Condition        0
Amenity                  0
Bump                     0
Crossing                 0
Give_Way                 0
Junction                 0
No_Exit                  0
Railway                  0
Roundabout               0
Station                  0
Stop                     0
Traffic_Calming          0
Traffic_Signal           0
Turning_Loop             0
Sunrise_Sunset           0
Civil_Twilight           0
Nautical_Twilight        0
Astronomical_Twilight    0
dtype: int64
In [13]:
# Visualize contributing factors using Plotly
fig = px.bar(df['Traffic_Signal'].value_counts().reset_index(),
             y='index', x='Traffic_Signal',
             orientation='h',
             labels={'index': 'Traffic Signal', 'Traffic_Signal': 'Count'},
             title='Distribution of Traffic Signal as a Contributing Factor')

# Show the plot
fig.show()
In [14]:
x = df['Weather_Condition'].value_counts().reset_index()

# filter the Weather_Condition 
x=x[x.Weather_Condition > 5000]

# Analyze patterns related to road conditions, weather, and time of day
# For example, let's check the distribution of road conditions

# Create a bar chart using Plotly
fig = px.bar(x,
             y='index', x='Weather_Condition',
             orientation='h',
             labels={'index': 'Weather Condition', 'Weather_Condition': 'Count'},
             title='Distribution of Weather Conditions')
#              template='plotly_dark')  # You can choose a different template if desired

# Show the plot
fig.show()
In [15]:
# Visualize accident hotspots (latitude and longitude) using Plotly
fig = px.scatter(df.sample(1000), x='Start_Lng', y='Start_Lat', color='Severity',
                 color_continuous_scale='viridis', opacity=0.7,
                 labels={'Start_Lng': 'Longitude', 'Start_Lat': 'Latitude', 'Severity': 'Severity'},
                 title='Accident Hotspots')

# Set layout for the figure
fig.update_layout(
    xaxis_title='Longitude',
    yaxis_title='Latitude',
    legend_title='Severity')

# Show the plot
fig.show()
In [16]:
# Convert Start_Time to datetime for time-based analysis
df['Start_Time'] = pd.to_datetime(df['Start_Time'])

# Extract date, hour, and day of the week from Start_Time
df['Date'] = df['Start_Time'].dt.date
df['Hour'] = df['Start_Time'].dt.hour
df['Day_of_Week'] = df['Start_Time'].dt.day_name()
In [17]:
# Analyze distribution of accidents by hour of the day using Plotly
fig = px.bar(df['Hour'].value_counts().reset_index(),
             x='index', y='Hour',
             labels={'index': 'Hour of the Day', 'Hour': 'Count'},
             title='Accident Distribution by Hour of the Day',
             category_orders={'index': sorted(df['Hour'].unique())})

# Rotate x-axis labels for better readability
fig.update_layout(xaxis=dict(tickangle=45))

# Show the plot
fig.show()
In [18]:
# Analyze distribution of accidents by day of the week using Plotly
fig = px.bar(df['Day_of_Week'].value_counts().reset_index(),
             x='index', y='Day_of_Week',
             labels={'index': 'Day of the Week', 'Day_of_Week': 'Count'},
             title='Accident Distribution by Day of the Week',
             category_orders={'index': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']})

# Rotate x-axis labels for better readability
fig.update_layout(xaxis=dict(tickangle=45))

# Show the plot
fig.show()

Correlation Analysis:¶

In [19]:
# Compute the correlation matrix
correlation_matrix = df.corr()

# Plot the correlation matrix heatmap using Plotly
fig = px.imshow(correlation_matrix,
                labels=dict(color='Correlation'),
                x=correlation_matrix.columns,
                y=correlation_matrix.columns,
                title='Correlation Matrix')
fig.update_layout(width=12*80, height=16*80)

# Show the plot
fig.show()

To explore factors influencing accident severity, we'll analyze how location, weather, and time of day relate to accident severity.¶

In [20]:
# Plot accident severity distribution by time of day using Plotly
fig = px.bar(df.groupby(['Hour', 'Severity']).size().reset_index(),
             x='Hour', y=0, color='Severity',
             labels={'0': 'Count', 'Severity': 'Severity'},
             title='Accident Severity by Hour of the Day',
             category_orders={'Severity': [1, 2, 3, 4]})

# Update layout to set the figure size
fig.update_layout(width=10*80, height=6*80)  # Assuming each inch is 80 pixels

# Show the plot
fig.show()
In [21]:
# Plot accident severity distribution by weather condition using Plotly
severity_by_weather = df.groupby(['Weather_Condition', 'Severity']).size().reset_index()

# Sort values in descending order by count
severity_by_weather = severity_by_weather.sort_values(by=0, ascending=False)

#Above 3000 values filtering severity_by_weather
x = severity_by_weather[severity_by_weather[0] > 3000]
# Create the bar chart
fig = px.bar(x,
             y='Weather_Condition', x=0, color='Severity',
             labels={'0': 'Count', 'Severity': 'Severity'},
             title='Accident Severity by Weather Condition',
             category_orders={'Severity': [1, 2, 3, 4]})

# Update layout to set the figure size
fig.update_layout(width=12*80, height=9*80)  # Assuming each inch is 80 pixels

# Show the plot
fig.show()
In [22]:
# Plot accident severity distribution by state using Plotly
fig = px.bar(df.groupby(['State', 'Severity']).size().reset_index(),
             y='State', x=0, color='Severity',
             labels={'0': 'Count', 'Severity': 'Severity'},
             title='Accident Severity by State',
             category_orders={'Severity': [1, 2, 3, 4]})

fig.update_layout(width=12*80, height=16*80)

# Show the plot
fig.show()

Hourly Trend Analysis:¶

For time series analysis, we'll analyze trends and patterns in accidents over time, focusing on temporal aspects such as hour of the day and month of the year

In [23]:
# Grouping by hour and counting accidents
hourly_accidents = df.groupby('Hour').size().reset_index(name='Number_of_Accidents')

# Plotting the hourly trend of accidents using Plotly
fig = px.line(hourly_accidents, x='Hour', y='Number_of_Accidents',
              markers=True, title='Hourly Trend of Accidents',
              labels={'Hour': 'Hour of the Day', 'Number_of_Accidents': 'Number of Accidents'})

# Update layout for better readability
fig.update_layout(xaxis=dict(tickmode='linear', tick0=0, dtick=1),
                  width=12*80, height=6*80)  # Assuming each inch is 80 pixels

# Show the plot
fig.show()

Multivariate Analysis using Heatmap¶

In [24]:
# Selecting relevant columns for multivariate analysis
selected_columns = ['Weather_Condition', 'Visibility(mi)', 'Severity']

# Creating a subset of the dataframe with selected columns
subset_df = df[selected_columns]

# Dropping rows with any missing values in the selected columns
subset_df.dropna(subset=selected_columns, inplace=True)

# Creating a heatmap to explore relationships using Plotly
heatmap_data = subset_df.groupby(['Weather_Condition', 'Visibility(mi)'])['Severity'].mean().reset_index()

fig = px.imshow(heatmap_data.pivot(index='Weather_Condition', columns='Visibility(mi)', values='Severity'),
                labels=dict(color='Severity'),
                x=heatmap_data['Visibility(mi)'].unique(),
                y=heatmap_data['Weather_Condition'].unique(),
                title='Relationships between Weather, Visibility, and Severity')

# Set layout for better readability
fig.update_layout(
    xaxis_title='Visibility (mi)',
    yaxis_title='Weather Condition')

# Update layout to set the figure size
fig.update_layout(width=12*80, height=19*80)  # Assuming each inch is 80 pixels

# Show the plot
fig.show()
C:\Users\klvko\AppData\Local\Temp\ipykernel_11040\1927272465.py:8: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Conclusion¶

Based on the analysis of the US Accidents dataset, several insights and patterns have been identified:

Severity Analysis:

Accidents are distributed across different severity levels, with Severity 2 being the most common. Severity levels vary based on factors such as weather conditions, road conditions, and visibility. Temporal Patterns:

Accidents have distinct patterns throughout the day, with higher frequencies during peak hours, particularly in the afternoon. Monthly analysis shows variations, with an increase in accidents during the winter months. Geospatial Analysis:

Accidents are distributed unevenly across states, indicating higher accident rates in certain regions. Weather and Road Conditions:

Adverse weather conditions like rain and snow are associated with higher accident severity. Poor road conditions are linked to increased accident severity. Contributing Factors:

Traffic signals and crossings play a role in accident severity, with accidents near crossings often being severe. Road conditions significantly influence accident severity, with poor road conditions leading to more severe accidents. Multivariate Analysis:

Multivariate analysis indicates that specific combinations of weather conditions, road conditions, and visibility affect accident severity. Time Series Analysis:

Accidents exhibit an hourly trend, peaking during certain hours of the day. There are variations in the number of accidents based on the month, suggesting seasonality. These insights can be utilized to enhance road safety measures, optimize traffic management, and develop targeted interventions to reduce the severity and frequency of accidents.

In [ ]: